df |>
mutate(gender = ifelse(gender == "m", 1, 0))Lecture 10:
Exploratory Data Analysis and Visualization, Part II
2024-12-12
dplyr()Source: Intro to R for Social Scientists
ifelse(test, yes, no) returns a value with the same shape as the logical test. Filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSEcase_when vectorises multiple ifelse() statements. It is the dplyr equivalent of if...elsestringr: str_replace(), str_detect(), etc.tolower and trimwsThe ggplot2 package is an implementation of Leland Wilkinson’s ‘Grammar of Graphics’.
ggplot2 is so good that it has become THE reference [In python, use plotnine to apply the grammar of graphics.]
Example from A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data using the built-in mtcars dataset in R.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
facets)…geom_bar()Which code produced the figure? (This question would not be an exam question, as it requires specific knowledge of geom_bar(). Solve it with R.)
Same format as quizzes and mock exam, including True/False questions, multiple-choice, and multiple-correct options. These are designed to test your understanding of the material.
There will also be 3-4 essay-style questions aimed at evaluating your ability to apply your knowledge to new situations.
You will not be required to write exact R code, but you should be able to interpret and understand the code provided in the exam.
I expect you to be familiar with all R commands and concepts covered in the lectures, exercises, in-class code, and additional practice exercises.
The readings are not mandatory for the exam. The focus will be on the material discussed in class and during the exercises.
Values are represented by their position relative to the axes: line charts and scatterplots.
Values are represented by the size of an area: bar charts and area charts.
Values are continuous: use chart type that visually connects elements (line chart).
Values are categorical: use chart type that visually separates elements (bar chart).
(Source: Data Visualization Basics for Economists)
“Greatest number of ideas in the shortest time with the least ink in the smallest space” (Edward Tufte, 1983)
Recommendations from Edward Tufte’s “The Visual Display of Quantitative Information” (1983)
We can quantify the Lie Factor of a graph as a measure of how much the graphic distorts the data.
“The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented.” (Tufte, 1983)
Lie Factor = \(\frac{\text{size of effect shown in graphic}}{\text{size of effect in data}}\)
Lie Factor = \(\frac{\text{Yang had 39.1% of total ink}}{\text{Yang had 22.5%}}\) = 1.74
“All variations lead to overestimation of small values and underestimation of large ones.” Kosara et al, 2018
Data-ink Ratio = \(\frac{\text{ink used for data points}}{\text{total ink used to print the graphic }}\)
Limits to this approach: we still need some ink to interpret and understand the data.
library(gridExtra)
library(ggthemes)
# High data-ink ratio graph (plot1)
plot1 <- mtcars |>
group_by(cyl) |>
count() |>
ggplot(aes(x = as.factor(cyl), y = n, fill = as.factor(cyl), label = n)) +
geom_bar(stat = "identity") +
geom_label(color = "white", fontface = "bold", show.legend = FALSE) +
labs(
title = "Car Cylinder Count with High Data-Ink Ratio",
subtitle = "Detailed representation with color, labels, and gridlines",
x = "Number of Cylinders",
y = "Count of Cars"
) +
theme_minimal() +
theme(
panel.background = element_rect(fill = "lightblue", color = NA),
panel.grid.major = element_line(color = "gray70", size = 0.5),
panel.grid.minor = element_line(color = "gray85", size = 0.25),
legend.position = "none",
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12, color = "gray20")
)
# Minimalist graph (plot2)
plot2 <- mtcars |>
group_by(cyl) |>
count() |>
ggplot(aes(x = as.factor(cyl), y = n)) +
geom_bar(stat = "identity", fill = "gray50") +
labs(
title = "Car Cylinder Count with Minimalist Design",
x = "Number of Cylinders",
y = "Count of Cars"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.title = element_text(size = 10),
axis.text = element_text(size = 8)
)
# Arrange both plots side by side
grid.arrange(plot1, plot2, ncol = 2)Source: simplexct.com
Source: simplexct.com
Works for tables as well…
The data density takes the number of data points that are being graphed relative to the physical size of the graphic to capture the principle of aiming to present many numbers in a small space:
| Election results | |
|---|---|
| Data Density Example | |
| candidate | votes |
| Trump | 49.8 |
| Harris | 48.3 |
49.8% voted for Trump, 48.3% for Harris.
Two pieces of advice I personally received:
What is wrong with the graph below? Create your own version of the graph.
Source: perceptual edge
Use the following data to create your own version of the graph:
dataChallenge <- data.frame(
Location = rep(c("Bahamas Beach", "French Riviera", "Hawaiian Club"), each = 3),
Fiscal_Year = rep(c("FY93", "FY94", "FY95"), times = 3),
Revenue = c(
250000, 275000, 350000, # Bahamas Beach (FY93, FY94, FY95)
260000, 200000, 210000, # French Riviera (FY93, FY94, FY95)
450000, 500000, 400000 # Hawaiian Club (FY93, FY94, FY95)
)
)ggplot(dataChallenge, aes(x = Fiscal_Year, y = Revenue, fill = Location)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(
title = "Resort Revenues by Location and Year",
x = "Year",
y = "Revenue (in USD)"
) +
scale_y_continuous(labels = scales::dollar) +
scale_x_discrete(labels = c("1993", "1994", "1995")) +
theme_classic() +
theme(legend.position = "bottom")ggplot(dataChallenge, aes(x = Fiscal_Year, y = Revenue, color = Location,
group = Location)) +
geom_line() +
labs(
title = "Resort Revenues by Location and Year",
x = "",
y = "Revenue (in USD)"
) +
scale_y_continuous(labels = scales::dollar) +
scale_x_discrete(labels = c("1993", "1994", "1995"),
expand = expansion(add = c(0.5, 0.5))) +
theme_classic() +
theme(legend.position = "bottom")ggplot(dataChallenge, aes(x = Fiscal_Year, y = Revenue, color = Location,
group = Location)) +
geom_line(size = 2) +
geom_text(data = dataChallenge[dataChallenge$Fiscal_Year == "FY95", ],
aes(label = Location), hjust = 0, nudge_x = 0.1, nudge_y = 0) +
labs(
title = "Resort Revenues by Location and Year",
x = "",
y = "Revenue (in USD)"
) +
scale_y_continuous(labels = scales::dollar, limits = c(0, 500000)) +
scale_x_discrete(labels = c("1993", "1994", "1995"),
expand = expansion(add = c(0.5, 1))) +
theme_classic() +
theme(legend.position = "none") +
scale_color_brewer(palette = "Set1")Data visualization is an art of story-telling, deception, and scientific exactitude 🤓.
Focus on steps 1-4 for this course.
tidytext: Converts text to/from tidy formats. Works well with tidyverse.
quanteda: Comprehensive package for preprocessing, visualization, and statistical analysis.
The base, raw material, of quantitative text analysis is a corpus. A corpus is, in NLP, a collection of authentic text organized into datasets.
quanteda: A data frame with a character vector for documents and additional metadata columns..json format (after web scraping), in a .csv format, or in simple .txt files.inauguration corpus from quanteda, which is a standard corpus used in introductory text analysis. It contains the inauguration discourses of the five first US presidents.readtext package. The text is contained in a csv file, and is loaded with the read.csv() function. The metadata of this corpus is the year of the inauguration and the name of the president taking office.Corpus consisting of 5 documents and 3 docvars.
text1 :
"Fellow-Citizens of the Senate and of the House of Representa..."
text2 :
"Fellow citizens, I am again called upon by the voice of my c..."
text3 :
"When it was first perceived, in early times, that no middle ..."
text4 :
"Friends and Fellow Citizens: Called upon to undertake the du..."
text5 :
"Proceeding, fellow citizens, to that qualification which the..."
Year President FirstName
1 1789 Washington George
2 1793 Washington George
3 1797 Adams John
4 1801 Jefferson Thomas
5 1805 Jefferson Thomas
Used to detect patterns in strings, replace parts of text, extract information from text.
The use of the stringr() package has made regular expressions easier to deal with.
str_count()Used to detect patterns in strings, replace parts of text, extract information from text.
The use of the stringr() package has made regular expressions easier to deal with.
str_count()# Count occurences of the mention of the first person pronoun "I"
str_count(corp, "I") # counts the number of "I" occurences. This is not what we want.[1] 30 6 24 23 28
[1] 23 6 13 21 18
# Extract the first five words of each discourse
str_extract(corp, "^(\\S+\\s|[[:punct:]]|\\n){5}") # ^serves to anchor at the beginning of the string, (){5} shows the group of signs must be detected five times. \S if for any non-space character, \s is for space, [[:punct:]] for punctuation, and \n for the string representation of paragraphs. Basically, it means: five the first five occurences of many non-space characters (+) followed either (|) by a space, a punctuation sign, or a paragraph sign.[1] "Fellow-Citizens of the Senate and " "Fellow citizens, I am again "
[3] "When it was first perceived, " "Friends and Fellow Citizens:\n\n"
[5] "Proceeding, fellow citizens, to that "
Tokens: Building blocks of text (words, punctuation, etc.).
Tokens: Building blocks of text (words, punctuation, etc.).
[1] "Fellow-Citizens" "of" "the" "Senate" "and"
[6] "of" "the" "House" "of" "Representatives"
[11] "Among" "the" "vicissitudes" "incident" "to"
[16] "life" "no" "event" "could" "have"
[1] "i" "me" "my" "myself" "we" "our" "ours"
[8] "ourselves" "you" "your" "yours" "yourself" "yourselves" "he"
[15] "him" "his" "himself" "she" "her" "hers" "herself"
[22] "it" "its" "itself" "they" "them" "their" "theirs"
[29] "themselves" "what" "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are" "was" "were"
[43] "be" "been" "being" "have" "has" "had" "having"
[50] "do" "does" "did" "doing" "would" "should" "could"
[57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're"
[64] "they're" "i've" "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll" "you'll" "he'll"
[78] "she'll" "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't"
[85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" "won't"
[92] "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't"
[99] "let's" "that's" "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an" "the" "and"
[113] "but" "if" "or" "because" "as" "until" "while"
[120] "of" "at" "by" "for" "with" "about" "against"
[127] "between" "into" "through" "during" "before" "after" "above"
[134] "below" "to" "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again" "further" "then"
[148] "once" "here" "there" "when" "where" "why" "how"
[155] "all" "any" "both" "each" "few" "more" "most"
[162] "other" "some" "such" "no" "nor" "not" "only"
[169] "own" "same" "so" "than" "too" "very" "will"
[1] "Fellow-Citizens" "Senate" "House" "Representatives" "Among"
[6] "vicissitudes" "incident" "life" "event" "filled"
[11] "greater" "anxieties" "notification" "transmitted" "order"
[16] "received" "14th" "day" "present" "month"
# We can keep words we are interested in
tokens_select(toks, pattern = c("peace", "war", "great*", "unit*"))Tokens consisting of 5 documents and 4 docvars.
text1 :
[1] "greater" "United" "Great" "United" "united" "great" "great" "united"
text2 :
[1] "united"
text3 :
[1] "war" "great" "United" "great" "great" "peace" "great" "peace" "peace" "United"
[11] "peace" "peace"
[ ... and 2 more ]
text4 :
[1] "greatness" "unite" "unite" "greater" "peace" "peace" "peace" "war"
[9] "peace" "greatest" "greatest" "great"
[ ... and 1 more ]
text5 :
[1] "United" "peace" "great" "war" "war" "War" "peace" "peace" "peace"
# Build N-grams (onegrams, bigrams, and 3-grams)
toks_ngrams <- tokens_ngrams(toks, n = 2:3)
# Build N-grams based on a structure: keep n-grams that containt a "never"
toks_neg_bigram_select <- tokens_select(toks_ngrams, pattern = phrase("never_*"))
head(toks_neg_bigram_select[[1]], 30)[1] "never_hear" "never_expected" "never_hear_veneration" "never_expected_nation"
Code Example:
Document-feature matrix of: 5 documents, 1,818 features (72.28% sparse) and 4 docvars.
features
docs among vicissitudes incident life event filled greater anxieties notification transmitted
text1 1 1 1 1 2 1 1 1 1 1
text2 0 0 0 0 0 0 0 0 0 0
text3 4 0 0 2 0 0 0 0 0 0
text4 1 0 0 1 0 0 1 0 0 0
text5 7 0 0 2 0 0 0 0 0 0
[ reached max_nfeat ... 1,808 more features ]
Use DTMs for:
Very basic statistics about documents are the top features of each document, the frequency of expressions in the corpus.
The frequency of tokens can be represented in a text plot.
DTMs are still used in business cases for description of text input.